[1] 0.5724067
DATA1220-55, Fall 2024
2024-10-25
If we assume \(H_0\): Independence, how can we estimate the data expected under the null hypothesis?
The Multiplication Rule for Independent Events
The probability of event A and event B occurring is the product of the probability that A occurs and the probability that B occurs.
| B | B’ | ||
| A | P(A + B) | P(A + B’) | P(A) |
| A’ | P(A’ + B) | P(A’ + B’) | P(A’) |
| P(B) | P(B’) | 1 |
Using proportions…
\[ \operatorname{Expected}_{\operatorname{A and B}}=P(A) \times P(B) \times n \]
Using counts…
\[ \operatorname{Expected}_{\operatorname{A and B}}=\frac{\operatorname{count}(A) \times \operatorname{count}(B)}{n} \]
The Chi-Square (\(\chi^2\)) test statistic is the sum of the squared difference between observed and expected value divided by the expected value for all combinations of categories.
\[ \chi^2_{df} =\sum^k_{i=1} \frac{{\left( \operatorname{observed} - \operatorname{expected} \right)}^2}{\operatorname{expected}} \]
For a two-way table, the degrees of freedom for a \(\chi^2\) test statistic are…
\[ \begin{aligned} df &= \left( n_{\operatorname{rows}} - 1 \right) \times \left( n_{\operatorname{cols}} - 1 \right) \\ &= \left( R-1 \right) \times \left( C-1 \right) \end{aligned} \]
…where \(R\) is the number of rows in the table and \(C\) is the number of columns.
You can find p-values in R using the pchisq() function, which takes a test statistic (q) and the degrees of freedom (df) as parameters.
For the \(\chi^2\) test, we always use the upper tail, or the probability of seeing a result more extreme than what we observed.
The \(\chi^2\) test is most appropriate for large sample sizes.
When sample sizes are small, use Fisher’s Exact Test.
Neither the \(\chi^2\) test nor the Fisher’s Exact Test will tell you the nature of the relationship between your 2 categorical variables.
Additional tests are needed to determine if the outcomes are dependent on variable 1, variable 2, or both.
You can use the margin of error calculation to estimate the sample size needed to detect a given difference in proportions.
\[ \operatorname{margin of error} = Z^* \times \sqrt{\frac{p(1-p)}{n}} \]
We want to know if people favor candidate 1 or candidate 2 (\(H_0\): \(P(C_1)=P(C_2)\)), but it will be a very close race. If we want to find 52% for candidate 1 vs 48% for candidate 2, what size sample do we need?
What margin of error do we need?
We want to know if people favor candidate 1 or candidate 2 (\(H_0\): \(P(C_1)=P(C_2)\)), but it will be a very close race. If we want to find 52% for candidate 1 vs 48% for candidate 2, what size sample do we need?
What margin of error do we need?
b. 2%
If we need a margin of error of 2%, and we want to have 95% confidence in our results, we can solve for \(n\) to find the minimum sample size needed.
\[ \begin{aligned} 0.02 &= 1.96 \times \sqrt{\frac{0.52(1-0.52)}{n}} \\ 0.01 &= \sqrt{\frac{0.52(1-0.52)}{n}} \\ 0.0001 &= \frac{0.52(1-0.52)}{n} \\ 0.0001n &= 0.250 \\ n &= 2397 \end{aligned} \]
Probabilitistic samples have less error than non-probabilistic samples
Results coming out are only as good as the data going in
Can you reliably, validly, generalizably describe the US when \(n=1000\)? \(n=2000\)?
Interviewer
Responder
Survey
Combining results from multiple surveys may improve accuracy
Are all polls created equal?
If the underlying assumptions are faulty, more data won’t improve the quality
Transparency
Point estimates? Confidence intervals?
Certainty
Poll’s sponsor and data collection firm
Participant selection process
Interview methods and dates
Sample sizes, non-response rates
Question phrasing
Weighting methods
DATA1220-55 Fall 2024, Class 22 | Updated: 2024-10-25 | Canvas | Campuswire